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Abstract 

In computational evolutionary biology, verification and benchmarking is a challenging task because the evolutionary his- 
tory of studied biological entities is usually not known. Computer programs for simulating sequence evolution in silico have 
shown to be viable test beds for the verification of newly developed methods and to compare different algorithms. However, 
current simulation packages tend to focus either on gene-level aspects of genome evolution such as character substitutions 
and insertions and deletions (indels) or on genome-level aspects such as genome rearrangement and speciation events. Here, 
we introduce Artificial Life Framework (ALF), which aims at simulating the entire range of evolutionary forces that act on 
genomes: nucleotide, codon, or amino acid substitution (under simple or mixture models), indels, GC-content amelioration, 
gene duplication, gene loss, gene fusion, gene fission, genome rearrangement, lateral gene transfer (LGT), or speciation. The 
other distinctive feature of ALF is its user-friendly yet powerful web interface. We illustrate the utility of ALF with two possible 
applications: 1) we reanalyze data from a study of selection after globin gene duplication and test the statistical significance of 
the original conclusions and 2) we demonstrate that LGT can dramatically decrease the accuracy of two well-established or- 
thology inference methods. ALF is available as a stand-alone application or via a web interface at http://www.cbrg.ethz.ch/alf. 
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Introduction 

To unravel evolutionary relations among single molecu- 
lar characters, genes, genomes, and species, computational 
evolutionary biology methods typically infer past events 
from current data. Because of the inherently unknown na- 
ture of these past events, model and method validation and 
comparison is notoriously difficult. Computer programs 
that simulate evolution in silico provide viable test beds 
to understand and characterize new models and methods. 
Since simulations rely on simplifying models, they neces- 
sarily lack realism. Nevertheless, they provide a way — often 
the only way — to investigate and test evolutionary models, 
algorithms, and implementations under controlled condi- 
tions. Thus, validation by simulation is often considered an 
insufficient but necessary step to propose a new method. 

Programs to simulate biological sequences can be 
divided into two main categories. In population genet- 
ics, simulation takes into account changes within and 
across populations of individuals that arise by models 
of sex, recombination, or gene conversion. Simulation is 
performed backward in time under the coalescent (e.g., 
Hudson 2002; Spencer and Coop 2004; Schaffner et al. 
2005) or forward in time (e.g., Peng and Kimmel 2005; 
Hoggart et al. 2007; Chadeau-Hyam et al. 2008; Hernandez 
2008; O'Fallon 2010; Peng and Liu 2010). In phylogenetics 
and evolutionary biology, simulation involves single rep- 
resentatives of species related by a tree and is performed 
forward in time. In this latter context, various simulation 
programs have been developed for different evolutionary 
models. Most of them simulate evolution at the gene or 



protein sequence level, as opposed to the genome level. 
Darwin (Gonnet et al. 2000) offers functions for mutating 
sequences along a branch, including gaps. PAML evolver 
(Yang 1997), Seq-Gen (Rambaut and Grassly 1997), and its 
extensions PSeq-Gen (Grassly et al. 1997) and CS-PSeq-Gen 
(Tuffery 2002) were among the first popular programs that 
allowed synthetic evolution of a DNA (or protein) sequence 
along a given tree. They include support for several mod- 
els of nucleotide substitution as well as site-specific rate 
heterogeneity. None of these programs support insertions 
and deletions (indels). Only very recently indel simulation 
was included alongside the point mutation process and ad- 
ditional sequence features such as variable over time mu- 
tation rates or sequence motifs: EvolveAGene (Hall 2008), 
MySSP (Rosenberg 2005), Hetero (Jermiin et al. 2003), indel- 
Seq-Gen (Strope et al. 2007), Rose (Stoye et al. 1998). An- 
other program, SISSI (Gesell and Haeseler 2006), simulates 
site-specific interactions. Although most programs typically 
use biologically very simple indel simulation, some programs 
also incorporate more advanced models for indel forma- 
tion and distribution: SIMPROT (Pang et al. 2005), Dawg 
(Cartwright 2005), and INDELible (Fletcher and Yang 2009). 
Additionally, PhyloSim (Sipos et al. 2011) simulates com- 
plex rate variations and selective constraints with multiple 
substitution and indel processes. To our knowledge, Evol- 
Simulator (Beiko and Charlebois 2007) is the only program 
to go beyond single sequence simulation and allowing ge- 
nomic effects such as gene duplication or lateral gene trans- 
fer (LGT). This program, however, is limited in the choice 
of evolutionary models and lacks support for insertions and 
deletions at the sequence level. 
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Here, we introduce Artificial Life Framework (ALF), which 
we developed with the long-term goal of simulating the en- 
tire range of evolutionary forces that act on genomes. In 
this first release, we primarily focus on species-level evolu- 
tion. ALF evolves an ancestral genome, represented by an 
ordered set of sequences, along a tree into a number of de- 
scendant synthetic genomes. At the level of a gene, ALF can 
simulate evolution at the nucleotide, codon, or amino acid 
level with indels and among-site rate heterogeneity and sup- 
ports most established models of character substitution. To 
mimic different types of sequences (e.g., coding sequence vs. 
noncoding, functional genes vs. pseudogenes, etc.), multiple 
sequence classes can be defined, each with their own mod- 
els of substitution, insertion-deletion, and among-site rate 
variation. At a more global genomic level, ALF can simulate 
GC-content amelioration (Lawrence and Ochman 1997), 
gene duplication and loss, genome rearrangements, several 
types of LGT, and speciation events. The user can provide 
a starting (ancestral) genome or have one generated ran- 
domly. The output consists of the simulated genomes, mul- 
tiple sequence alignments, and gene trees of all gene fami- 
lies, all ancestral sequences, the true species tree including 
LGTs, and for each pair of genomes the sets of orthologous, 
paralogous, and xenologous sequences. In future releases of 
ALF, we aim to incorporate more evolutionary models, in- 
cluding population-level events such as recombination. 

This article is organized as follows. We first provide an 
overview of ALFs architecture and briefly describe how to 
set up a simulation scenario via ALFs web interface. We then 
summarize various control experiments conducted to en- 
sure ALFs correctness. Finally, we present applications of 
ALF to two bioinformatics problems: First, we reanalyze data 
from a study of selection after globin gene duplication and 
test the statistical significance of the original conclusions; 
second, we demonstrate that LGT can dramatically decrease 
the accuracy of two well-established orthology inference 
methods. 

Methods and Materials 

Overview of the Simulation Process 
ALF generates a set of species genomes starting from a sin- 
gle ancestral genome sequence. The ancestral genome may 
be represented by biological sequences supplied by the user 
or generated randomly according to user specifications. A 
species tree may also be specified by the user or randomly 
generated. In the course of the simulation, ALF evolves the 
root genome along the tree, where each node defines a spe- 
ciation event. The emerging genomes are exposed to the 
evolutionary processes implemented in ALF. 

Figure 1 gives a graphical overview of the ALF simula- 
tion pipeline. Character substitutions occur according to 
the substitution probability matrix of a selected amino acid, 
codon, or nucleotide model for a given branch length. Dif- 
ferent models can be specified for simulation, for exam- 
ple, one codon and one nucleotide model could be used 
to distinguish coding and noncoding regions, respectively. 
The rate of substitution can differ over sites and genes. ALF 



root genome 




speciation 




birth-death process 




random subtree of ToL 




user-defined 




character substitution 




DNA, codon or AA models 

DNA: F84, HKY85.TN93, GTR 
Codon: CodonPAM, ECM, M-series 
AA: PAM, JTT, LG, WAG 




GC amelioration 




rate heterogeneity 

Gamma, Poisson, user-defined 




indel events 




random location 




length distributions 

Zipfian, GQG, neg. bin., user-def. 




gene duplication/loss 




single/multiple genes 




tandem duplications 




translocation 




lateral gene transfer 




novel acquisition 




orthologous replacement 




genome rearrangement 




inversion 




translocation 




inverted translocation 




gene fusion/fission 



final genomes 
ancenstral genomes 
all gene trees 
true tree w/LGTs 
true MSAs 

set of ortho-/para-/xenologs 



Fig. 1. Overview of the ALF simulation process. A root genome is 
evolved along a species tree. Events at the site, sequence and genome 
level are simulated iteratively. 



allows for each species to have its own underlying equilib- 
rium base frequencies, for instance, to simulate drift toward 
species-specific GC content. 

The simulation of other evolutionary processes is based 
on Gillespie's algorithm with exponential waiting times 
(Gillespie 1977), providing for realistic scenarios with 
parallel simulations of events at the sequence and genome 
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level. Indel events occur additionally and independently 
of substitutions with separate rates for indels. Indels that 
would disrupt protein function by introducing frame shifts 
are not allowed when simulating coding DNA. In that case, 
ALF only simulates in-frame indels of nucleotide triplets or 
codons, and insertions will not contain any stop codons. 
Gene duplications, gene losses, and LGT alter the content 
of each genome. All three types of events can affect sin- 
gle or multiple genes. Genes in a genome can also be af- 
fected by gene fusion events that join neighboring genes 
into one and fission events that break an existing gene into 
two. Finally, rearrangement events lead to the shuffling of 
the genes within a genome. 

The simulation starts by reading or generating a root 
genome and a species tree that defines speciation times. 
Then, gene- and genome-level events are generated on each 
branch by drawing a random waiting time from an exponen- 
tial distribution using the total rate of all events. An event 
to occur after that time interval is selected based on its rel- 
ative rate. Sequence level events (substitutions and indels) 
are delayed until a gene/genome level event depends on 
their execution or until the end of each branch. First, substi- 
tutions are performed using a substitution probability ma- 
trix. Afterward, indels are generated in a similar manner as 
gene/genome level events. 

ALF is highly configurable, allowing the simulation of ei- 
ther all or an arbitrary subset of evolutionary events. Size and 
number of the resulting organisms is only limited by com- 
putational power. For example, evolving 20 genomes based 
on the coding sequences of Escherichia coli (4,352 sequences 
with a total length of 1,368,902 codons), using the empirical 
codon model by Schneider et al. (2005) takes about 4.5 h on 
an single Intel Xeon core at 2.26 GHz. 

Evolutionary Events 

Speciation 

Speciation events in ALF occur according to a species tree, 
fixed prior to simulation. ALF offers three options for obtain- 
ing this tree: 1) A tree is sampled according to the birth- 
death process with parameters A and \i as described by 
Gernhard (2008); because the resulting trees are ultramet- 
ric, an exponentially distributed deviation is applied to each 
branch according to Guindon and Gascuel (2003), 2) a tree 
can be obtained by sampling uniformly from a variance 
weighted least squares tree on 1,038 species based on data 
from the OMA project (Dessimoz and Gil 2008), and 3) A 
custom tree (e.g., a priori inferred) may be specified by the 
user in Newick or DARWIN format. 

At a speciation event, the two new species inherit the 
ancestral genome while the genome-specific parameters 
such as target GC content are adapted. As the simulation 
progresses, the genomes of the two species evolve indepen- 
dently and start undergoing different mutations and accu- 
mulating differences. 

Sequence Types 

For each segment of the root genome, a sequence type can 
be specified, which is defined by a substitution model, an 



indel model, and a model for rate heterogeneity among 
sites as described below. Switches between types can oc- 
cur at gene duplication and speciation events, as specified 
by the user. Sequence types allow for simultaneous simula- 
tion of sequences with different characteristics, for example, 
for coding and noncoding sequences. 

Substitution 

The user can choose to simulate nucleotide, codon, and/or 
amino acid substitution. When simulating codon substitu- 
tion, mutations that lead to the formation of a stop codon 
(nonsense mutations) are ignored. Although stop codons 
may occur in nature, in particular near the end of a se- 
quence, they usually have a deleterious effect on protein 
function and will be selected against. 

Nucleotide Substitution Models. Nucleotide substitution is 
simulated using one of four well-known Markov models. 
The HKY (Hasegawa et al. 1985) and F84 (Felsenstein and 
Churchill 1996) models allow for a different rate of transi- 
tions and transversions as well as unequal base frequencies. 
The TN93 model (Tamura and Nei 1993) is more general 
in that it models transitions with two parameters. Other 
simpler models, such as JC and K80, can be viewed as spe- 
cial cases of TN93. Finally, model GTR (Tavare 1986) is the 
most general time reversible. In addition to the equilibrium 
base frequencies, it allows for six different rate parameters 
describing each type of substitution. 

Codon Substitution Models. Codon models recently came 
into prominence, but very few programs allow simulation of 
the codon substitution process (for a review, see Anisimova 
and Kosiol 2009). ALF enables the simulation of protein- 
coding sequences under a range of codon models: the para- 
metric site models M0, M2, M3, and M8 with variable selec- 
tion pressure over sites (Yang et al. 2000), empirical codon 
models derived by Schneider et al. (2005) and Kosiol et al. 
(2007). Alternatively, a user-specified matrix can be used. 

Amino Acid Substitution Models. For simulations at the 
amino acid level, ALF offers seven substitution models. 
These include PAM (Dayhoff et al. 1978), Gonnet (Gonnet 
et al. 1992), JTT (Jones et al. 1992), WAG (Whelan and 
Goldman 2001), LG (Le and Gascuel 2008), as well as two 
models for ordered and disordered proteins (Szalkowski and 
Anisimova 2011). The user can also choose to provide a 
custom exchangeability matrix and frequencies. 

Rate Heterogeneity, Domains, and Motifs. Not all genes in an 
organism mutate at the same rate. Likewise, within genes, 
different regions or sites may evolve faster or slower — for ex- 
ample, transmembrane regions or active sites are known to 
evolve fast or slow, respectively. ALF acknowledges this fact 
in two ways. On the genome level, the overall rate of each 
gene can be modified by a factor drawn from the Gamma 
distribution (J?) with parameter a g . This factor also affects 
the indel rates. On the sequence level, ALF supports the 
Gamma model with invariable sites (T + I) (Gu et al. 1995). 
Alternatively, the sequence can be divided into a number of 
domains, all having their own mutation rate. The number 
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of domains for a gene is chosen at random from a uniform 
distribution between 1 and a user-defined maximal value, 
and the mutation rate for each domain is drawn from the 
Poisson distribution with user-specified mean. 

When real sequence data are used for the root genome, 
the user has the possibility to provide custom rates. This tai- 
lors for simulating the evolution of well-characterized pro- 
teins or can be used for testing the robustness of estimates 
from empirical studies. Since domains can be as small as sin- 
gle amino acids and the mutation rate can be set to zero, it 
is possible to construct strictly conserved motifs. 

GC-Content Amelioration. When ALF is used to simulate 
evolution at the nucleotide or codon level, differences in 
GC content can be simulated in two ways. One possibil- 
ity is to switch to different substitution models specified by 
the user at the internal nodes of the species tree. The other 
possibility is to assign target nucleotide frequencies to the 
genomes at the leaves of the species tree, either globally for 
all substitution models or for each substitution model sepa- 
rately. During the simulation, these frequencies are used to 
compute the mutation matrix and to create sequence frag- 
ments for insertions. As a consequence, GC content con- 
verges over time to the target value defined for the leaves 
of the tree. The actual nucleotide frequencies follow the for- 
mula 7r t = 7T 0 • e Qt , where tt 0 and 7r t are the base frequencies 
at the beginning and end of the branch, respectively, and the 
target frequencies are used in the rate matrix Q (Yang and 
Roberts 1995). The target frequencies for the internal nodes 
up to the root are computed as averages weighted by the 
lengths of the outgoing branches. All genes within an organ- 
ism (using the same substitution model) share the same GC 
content. However, LGT may work against GC amelioration, 
keeping the GC content different for some genes. 

Target frequencies can be set globally or per substitution 
model, and the user has the choice of having target frequen- 
cies generated randomly or supplying his own. 

Indel Formation 

Length distributions and rates can be specified separately 
for indels. Several possibilities for modeling the indel length 
distribution have been implemented. The first model uses 
the negative binomial distribution, which takes two pa- 
rameters, an integer (r) and a proportion (q). By setting 
r = 1, the distribution is geometric and equivalent to the 
affine scoring model with gap open and gap extension costs 
(Gotoh 1982). 

The second method models indel lengths using the Zip- 
fian distribution with one parameter (Benner et al. 1993; 
Chang and Benner 2004). 

Another available model is the generalized Qian- 
Goldstein indel length distribution (Qian and Goldstein 
2001), which uses a mixture of exponentials and adapts to 
the branch lengths. Finally, the user can specify a customized 
general discrete distribution. 

In order to cut the tail of the chosen indel length distri- 
bution, a maximum for the length of an indel can be de- 
fined at which the distribution is truncated. For insertions, 



the content of the inserted segment is drawn from the equi- 
librium character distribution for the species. 

ALF offers two ways of placing deletions within the 
sequence: 1) Following Cartwright's model in Dawg 
(Cartwright 2005), each sequence is considered to be 
embedded in a longer virtual sequence. Deletions affect- 
ing the simulated sequence can start and/or end in this 
virtual sequence. A deletion of length in a sequence of 
length L s can therefore start at any position in the interval 
[— + IjLj. This method ensures equal deletion proba- 
bility for all sites but biases the deletion length distribution 
toward smaller deletions because gaps overlapping with the 
beginning or end of sequence get truncated. 2) Deletions 
are placed within the simulated sequence in their entirety. 
Given a deletion length L d and sequence length L s , the dele- 
tion can start at any position in the interval [1, L s — L p + 1]. 
This way, the length of deletions matches more closely the 
distribution specified, but with this strategy, the deletion 
rate is not uniform over the entire sequence: the deletion 
probability for sites close to the ends of the sequence is 
lower than in the middle. 

Gene Duplication and Gene Loss 

Gene duplication and gene loss events occur randomly and 
in parallel to the evolutionary events at the sequence level. 
They can comprise one or several consecutive randomly se- 
lected sequences up to a user-defined maximum. The new 
copies from a duplication event are inserted as new se- 
quences either directly after their originals or at a random 
position in the genome. Furthermore, ALF accounts for neo- 
functionalization (Ohno 1970) and subfunctionalization 
(Lynch and Conery 2000) by temporarily altering the muta- 
tion rate of duplicates or both, originals and duplicates, by 
a user-defined amount. In the case of neofunctionalization, 
this behavior corresponds to the idea that an organism can 
afford to mutate a copied gene at a higher rate while still 
maintaining the original function. The concept of subfunc- 
tionalization assumes both copies of a duplicated gene to 
be freed of evolutionary constraints and evolve in a comple- 
mentary fashion. 

Similarly, one or more genes are removed from the 
genome in a gene loss event. 

Lateral Gene Transfer 

Apart from gene duplication, LGT is the second evolution- 
ary process allowing a genome to acquire new genes. ALF im- 
plements two kinds of LGT: orthologous replacement and 
"novel acquisition" (Doolittle et al. 2003). In the first case, 
the newly acquired gene replaces an orthologous gene in the 
recipient. In the second case, the new gene is added to the 
recipient genome without any replacement. For "novel ac- 
quisition," not only a single gene but a whole group of genes 
can be transferred. 

The donor and recipient genomes as well as the genes to 
be transferred are chosen at random. In the case of "novel 
acquisition," the transferred genes are inserted at a random 
position in the recipient's genome. With orthologous re- 
placement, only genes that exist in both genomes can be 
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transferred, and the transferred gene replaces its ortholog 
in the recipient. 

Gene Fusion and Fission 

ALF supports gene fission events that break a gene into two 
new sequences, either at a random location or at domain 
borders. Fusion events merge two or more existing genes 
into one new sequence. 

Genome Rearrangement 

In nature, the gene positions within a genome are not 
fixed. Several mechanisms may cause gene translocations 
or changes in the direction of the coding strand (Sankoff 
and Nadeau 2003). In ALF, these genome rearrangements 
are modeled as two independent phenomena. 

First, genes can be moved to a different position within 
the genome. Such translocations occur at a user-defined 
rate and move a random number of consecutive genes to 
a random position within the genome. 

The second process is the inversion of a segment of the 
genome, which results from the replacement of a segment 
of DNA by its reverse complement. Translocations and in- 
versions can also occur simultaneously, leading to so-called 
inverted translocations. 

Program Output 

Although the probabilities of all evolutionary processes can 
be adjusted, all events occur at random positions at ran- 
dom points in time. Running ALF multiple times with the 
same parameters will therefore lead to generating different 
genomes with different histories in each run. 

ALF saves the genomes of all species that arise during the 
simulation, including the ancestors of the final species, as a 
DARWIN (Connet et al. 2000) database containing all pro- 
tein and DNA sequences (for simulations at the codon or 
nucleotide level) and their evolutionary history. Addition- 
ally, the sequences are saved in the FASTA format. Further- 
more, true alignments and evolutionary histories (including 
all LGTs) are recorded during the simulation process for all 
gene families. A set of all orthologous genes of a gene family 
can also be assembled. 

Web Interface and Stand-Alone Version 
A web interface for ALF is available at http://www.cbrg. 
ethz.ch/alf. It provides the user with an intuitive way for set- 
ting simulation parameters and is organized hierarchically, 
reflecting the level upon which each force acts (fig. 1). To 
make usage of ALF as simple as possible, the user can rely 
on contextual help for all parameters as well as a selection 
of presets for typical applications (including those outlined 
below). The web interface can be used either to prepare a 
configuration file for the stand-alone version of ALF or to 
run the simulation directly on our servers. 

For extensive simulations, we recommend using the 
stand-alone version of ALF that is available for Linux and 
Mac OS X and can be downloaded freely from the same web 
address. 



Validation 

To validate our simulation framework, we ran separate sim- 
ulations to test the various processes and models of the 
framework: 

• We ascertained that basic properties of the simulation 
were not violated. For example, without the simulation 
of gene loss, sequences should not disappear from the 
genomes. Similarly, the sequence lengths should not 
change when no indels are simulated. 

• We ensured that evolutionary distances between pairs 
of resulting sequences matched the distances from the 
input tree when estimated using the same model (sup- 
plementary fig. 5, Supplementary Material online). 

• For parametric models, we used codeml (Yang 1997) 
to reestimate the parameters of the substitution rate 
matrices used for the simulation. In all cases, the 
parameter estimates were close to the true values (sup- 
plementary table 2, Supplementary Material online for 
codon models and supplementary table 3, Supplemen- 
tary Material online for nucleotide models). 

• We compared the distribution of indel length specified 
in the simulation to the resulting distribution of gap 
lengths in the true alignment between gene pairs, using 
a x 2 ' test f° r goodness of fit. As long as there were no or 
few overlapping indel events, no significant differences 
were observed. As the proportion of overlapping indels 
increases, the length distribution of gaps becomes bi- 
ased toward longer gaps (supplementary figs. 6 and 7, 
Supplementary Material online) — an expected behav- 
ior because overlapping indels appear as a single gap. 

• In simulations with GC-content amelioration, we con- 
firmed that, when the branch lengths are sufficiently 
long, the base frequencies converge to the specified tar- 
get frequencies, both for nucleotide and codon models. 

• We ensured correctness of the Gillespie algorithm by 
comparing the distribution of waiting times between 
events to the theoretical exponential distribution us- 
ing a x 2 - test f° r goodness of fit (supplementary fig. 8, 
Supplementary Material online). 

Results and Discussion 

In order to illustrate the utility of ALF, we discuss two ex- 
ample studies below. First, we investigate the accuracy of a 
clade codon model (Bielawski and Yang 2004) by looking at 
the empirical distribution of the parameters. In the second 
study, we analyze how LGT affects prediction accuracy of 
two orthology inference methods. 

Testing Selection Regimes in Clobin Gene Family 
In vertebrates and some other species, oxygen is transported 
from lungs to tissues by means of binding with hemoglobin. 
In most vertebrates, hemoglobin is a tetramer of two pairs 
of subunits designated a and (3. In placental mammals, two 
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Table 1. ML Estimates of Model Parameters for the Globin Data Set and the "Clobin-Like" Simulated Data. 



Data Set 


Model 


Parameters Estimates 


/ 


Globin data set 


MO 


uj = 0.191 


— 2477.82 


(real data) 










M3 


u a = 0,p 0 = 0.134 


-2442.4 






uii = 0.072, p, = 0.604 








uj 2 = 0.607, p 2 = 0.262 






MD 


uio = 0.053, p 0 = 0.716 


-2435.65 






= 0.649, pi = 0.154 








ui le = 0.045, ll) 2i = 0.991, p 2 = 0.130 




ALF simulation 


MO 


u) = 0.202 




(100 replicates) 


M3 


d>o = 0.036, p 0 = 0.426 








&, = 0.229, p, = 0.374 








Q 2 = 0.761, p 2 = 0.200 






MD 


u> 0 = 0.060, p 0 = 0.673 








d)i = 0.640, p, = 0.188 








uj le = 0.191, (I) 27 = 1.150,p 2 = 0.139 





Note. — w. selective pressure (dN/dS ratio), p;: proportions of uj classes. For real data, log likelihoods are shown. MD is the preferred model according to the likelihood 
ratio test, Akaike information criterion (AIC) and Bayesian information criterion (BIC). 



paralogs (e and 7) are expressed during early development 
instead of (3. The persistence of these two genes in the 
genome since the duplication event about 80-1 00 Ma is be- 
lieved to be an example of genetic cooption, that is, the shift 
of a trait or gene to a new function. 

Selective pressures after gene duplication in the globin 
gene family has been studied using the codon model D 
(MD), and a Bayesian approach was used to detect sites that 
have experienced changes in selective pressure following the 
genetic cooption event (Bielawski and Yang 2004). Instead 
of assuming the same parameters for the whole tree, MD al- 
lows for different selective pressure in two clades of the gene 
tree following the duplication event. The selective pressure 
on the protein is measured by the dN/dS ratio {uj), which 
reflects negative selection if uj < 1 and positive selection if 
uj > 1. Among-site variation of selective pressure is mod- 
eled using k discrete site classes, one of which allows differ- 
ent uj parameters in the two a priori specified clades. Based 
on their analyses, the authors suggested that a fraction of 
sites evolves with different selective pressures in the two 
clades, with strict negative selection affecting clade e and 
relaxed purifying selection on clade 7. The complexity of 
their model, combined with the short sequence length of 
the genes in question, makes it difficult to assess the confi- 
dence in their estimate, in particular, whether the differing 
selection class for the 7 globin clade indeed corresponds to 
mild purifying, neutral, or even positive selection. Our re- 
analysis of the data (a list of genes is included in the supple- 
mentary table 1, Supplementary Material online) resulted in 
slightly different values, with uj 2i being very close to 1, which 
we attribute to the removal of columns with gaps. The maxi- 
mum likelihood (ML) estimates of the model parameters are 
summarized in table 1. 

To assess the accuracy of MD, we used ALF to simulate 
three runs of 100 "globin-like" data sets under MD (k = 3), 
fixing uj 2l = 1 and keeping all other parameters at their 
ML estimates for the original data set. We then reestimated 
the model parameters for each replicate using codeml. To 



avoid local optima, we ran codeml three times using dif- 
ferent starting values and chose the result with the high- 
est likelihood. Finally, we compared the ML estimate of uj 2i 
obtained from the real data to the distribution of ML esti- 
mates obtained from simulated replicates with uj 2j = 1. 
As can be seen from figure 2a, the ML estimate for uj 2i 
(0.991 here or 0.79 in the original study) lies well within 
the variance of this parameter. Therefore, our observations 
indicate that neutrality for this class in clade 7 cannot be re- 
jected, and there is no evidence for relaxed purifying selec- 
tion on this clade, contrary to the conclusions of Bielawski 
and Yang (2004). 

In addition, we also simulated another 1 00 replicates with 
10,000 codons to ensure that the mean estimate converges 
to the true value for long sequences. Figure 2b shows that 
this is indeed the case. 

Our results therefore suggest that MD might be too com- 
plex to give reliable estimates for the short sequences from 
the globin gene family. For the simpler models M0 and M3, 
on the other hand, this did not appear to be a problem, as 
the variances of their parameters were much smaller on the 
simulated data (mean values: table 1; distributions: supple- 
mentary figures 2 and 3, Supplementary Material online). 

Ortholog Prediction in Presence of LCT 
LGT is widely recognized as a major force in prokaryotic 
genome evolution, although the extent of LGT is still dis- 
puted (Ragan and Beiko 2009). Nevertheless, orthology pre- 
diction projects only consider vertical inheritance of genes. 

We used ALF to analyze the influence of LGT on the per- 
formance of two well-established programs for orthology 
prediction, Inparanoid 4.1 (Remm et al. 2001) and OMA 
(Roth et al. 2008). We simulated data sets with different 
amounts of gene duplications and LGT for a tree with 30 
species, sampled from the tree of 7-proteobacteria (sup- 
plementary fig. 4, Supplementary Material online). The 
root genome of each simulation consisted of 200 ran- 
domly generated sequences using amino acid frequencies 
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Fic. 2. The distribution of ML estimates for u; 27 from simulation with ALF (a) for one run with sequence length matching the real data (144codons; 
other data shown in supplementary fig. 1, Supplementary Material online), and (b) for sequences of 10,000 codons. Data simulated under MD with 
ci/27 = 1, all other parameters are as in table 1. 



from the WAG model (Whelan and Goldman 2001), which 
was also used for substitutions. Sequence lengths followed 
the r length distribution that we fitted on data from 7- 
proteobacteria. A gene loss rate was chosen that kept the 
number of genes roughly constant. We then used the re- 
sulting synthetic genomes as input for the two prediction 
pipelines. To avoid differences attributable to homology in- 
ference, we used the same procedure for both OMA and 
Inparanoid, namely all-against-all Smith-Waterman align- 
ment with score cutoff of 181 (roughly corresponding to an 
£ value of 10" 14 ). 

The results of the analysis are summarized in a precision- 
recall plot (fig. 3), where precision is defined as the frac- 
tion of true positives among predicted positives, and recall 
corresponds to the fraction of true positives recovered by 
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Fic. 3. Precision/recall of orthology predictions with different propor- 
tions of genes with a history of duplications and/or LGT. Each data 
point corresponds to the mean of five independent runs using the 
same parameters (with 95% confidence interval in both dimensions). 



a method, that is, its statistical power. For both methods, 
varying the LGT and gene duplication rates appear to have 
an almost orthogonal effect on prediction accuracy. Increas- 
ing the duplication rate mainly affects recall, decreasing the 
fraction of recovered true positives. When a larger fraction 
of genomes consists of duplicated genes, the orthology 
prediction task becomes more challenging because effects 
such as differential gene loss complicate the inference prob- 
lem. Thus, consistent with previous studies of OMA and In- 
paranoid (Altenhoff and Dessimoz 2009; Boeckmann et al. 
2011; Linard et al. 2011), we observe that these methods 
are relatively conservative in their predictions of difficult 
scenarios. 

On the other hand, increasing LGT worsens precision, re- 
ducing the fraction of true positives among predicted posi- 
tives. For both methods, it appears that laterally transferred 
genes replacing an existing sequence in the recipient species 
are more difficult to distinguish from true orthologs by ei- 
ther algorithm. Indeed, both Inparanoid and OMA have 
been developed under the assumption that the sequences 
are related through speciation and duplication events only, 
not lateral transfer events. This analysis confirms that ignor- 
ing lateral transfer events can lead to a significant fraction 
of false-positive orthology predictions. 



Conclusion 

The lack of knowledge of the evolutionary history is a ma- 
jor challenge when developing new models and methods 
in computational biology. Although a computer program 
will never be able to describe the entire evolutionary reality 
and might ignore potentially important factors, simulation 
packages have proven to be useful tools for analyzing and 
comparing the performance of new algorithms. In contrast 
to the majority of existing tools, ALF can simulate processes 
at the genomic level, rendering itself useful for a broad range 
of analyses in gene and genome evolution. With the two ex- 
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ample case studies, we illustrated what such analyses might 
look like. Other possible applications include benchmark- 
ing of alignment or tree building methods (including meth- 
ods for multiple loci or based on gene rearrangement) and 
strategies for gene and species tree reconciliation. 

Although ALF already implements more evolutionary 
models than any other publicly available simulation tool, 
the long-term goal of ALF is to realistically simulate the 
entire range of evolutionary forces that act on genomes. 
Currently, ALF is particularly well suited for simulation of 
prokaryotic evolution, where it covers many of the evolu- 
tionary processes typically relevant. And yet, many impor- 
tant aspects of evolution have yet to be implemented. As 
next steps, we plan to incorporate models of recombination 
and of promiscuous domain evolution (Basu et al. 2008). 
Possible further improvements could include models of in- 
teractions between sequences as well as of patterns such as 
codon biases or tandem repeats on the sequence level. 

As new evolutionary forces are discovered, we are confi- 
dent that these can be included in ALF, allowing for more re- 
alistic simulations and more thorough testing of algorithms. 

Supplementary Material 

Supplementary table S1 and figures S1-S4 are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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